feat: Add tell() to OutputStream writers by geruh · Pull Request #2998 · apache/iceberg-python

geruh · 2026-02-02T00:17:31Z

Rationale for this change

Currently, PyIceberg writes one manifest per snapshot operation regardless of manifest size. In order to eventually support this we need to be able to track written bytes without closing the file, so that we can roll to a new file once we hit target size.

We had some of this work done in #650, but we can keep this simple and add writers as a follow up. The nice thing is that the underlying streams we support already have a tell() method and we just need to expose it.

With this change in the follow up we can do:

with write_manifest(...) as writer:
    writer.add_entry(entry)
    if writer.tell() >= target_file_size:
        # roll to new file

Are these changes tested?

Yes, added a test :)

Are there any user-facing changes?

No

kevinjqliu

nice! lgtm

Currently `Table.append(df)` and `Table.overwrite(df)` only accept a materialised `pa.Table`, which forces callers to load the entire dataset into memory before writing. This makes pyiceberg unusable for large or unbounded inputs and has been a recurring complaint (apache#1004, apache#2152, dlt-hub#3753). Allow `pa.RecordBatchReader` as an alternative input. When a reader is provided, batches are streamed and microbatched into target-sized Parquet files via the new `bin_pack_record_batches` helper, then committed in a single snapshot via the existing fast_append path. Memory is bounded by `write.target-file-size-bytes` (default 512 MiB) per worker rather than the full input size. Scope of this PR — unpartitioned tables only. Streaming into partitioned tables raises NotImplementedError pointing back to apache#2152; partitioned support needs additional design (high-cardinality partition handling, per-partition rolling writers) and is tracked as a follow-up. Mirrors iceberg-go#369's staging — that project shipped unpartitioned streaming first. Internal note: the implementation buffers up to `target_file_size` of in- memory RecordBatches before flushing to a Parquet file. A more memory- efficient rolling-ParquetWriter approach is a planned follow-up that will benefit from the `OutputStream.tell()` API added in apache#2998.

feat: Add tell() to OutputStream writers

ab13445

kevinjqliu approved these changes Feb 2, 2026

View reviewed changes

kevinjqliu merged commit 7e66ccb into apache:main Feb 2, 2026
11 checks passed

paultmathew mentioned this pull request May 7, 2026

feat(2152): support pa.RecordBatchReader in Table.append/overwrite #3335

Open

paultmathew mentioned this pull request May 7, 2026

Support writing Arrow RecordBatchReader or Scanner to Iceberg tables #2152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add tell() to OutputStream writers#2998

feat: Add tell() to OutputStream writers#2998
kevinjqliu merged 1 commit intoapache:mainfrom
geruh:tellme

geruh commented Feb 2, 2026

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geruh commented Feb 2, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants